Lab 1 - Exploring Table Data¶

In this document, we are considering the Stroke Prediction dataset. It was collected to help predict a patient's susceptibility to cerebrovascular accidents (i.e. strokes).

This dataset was sourced from kaggle and is a collection of attributes describing around 5000 medical patients.

Team 👩‍🔬👨‍🔬👨‍🔬¶

Conducting this analysis is a team of three:

  1. Samina Faheem
  2. Giancarlos Dominguez
  3. Kassi Bertrand

Stakeholders & Interests 💼¶

The benefits in accurately predicting this information will be helpful to Neurologists in detecting patterns in their diagnosis and initiating preventive care for patients when needed.

According to the World Health Organization(WHO), strokes are the 2nd leading cause of death globally, contributing to 11% of total deaths. Moreover, WebMD reveals that the average time period elapsed between a stroke patient entering the emergency room and a doctor reviewing the CT results and initiating treatment is 45 minutes.

Our algorithm has the potential to not only reduce the number of stroke deaths, but also reduce the number of 911 calls related to strokes and response time of medical professionals.

Since our algorithm is intented for day-to-day use in Neurology departments, it must be both accurate and precise. In other words, our predication algorithm must consistently produce accurate results.

Our target accuracy for this algorithm is: 98%.

Loading the Stroke Dataset¶

We begin our exploration phase with loading the dataset in a Pandas dataframe.

In [1]:
import pandas as pd
pd.set_option('expand_frame_repr', False)
pd.set_option('display.max_columns', 7)

#Load Stroke dataset from filesystem into Pandas dataframe.
stroke_df = pd.read_csv('./stroke-dataset.csv')
print(stroke_df.head())
      id  gender   age  ...   bmi   smoking_status stroke
0   9046    Male  67.0  ...  36.6  formerly smoked      1
1  51676  Female  61.0  ...   NaN     never smoked      1
2  31112    Male  80.0  ...  32.5     never smoked      1
3  60182  Female  49.0  ...  34.4           smokes      1
4   1665  Female  79.0  ...  24.0     never smoked      1

[5 rows x 12 columns]

With our dataset loaded, let's print out the total number of observations:

In [2]:
print(stroke_df.shape[0])
5110

Let's now list the names and the types of the features in the dataset.

In [3]:
print(stroke_df.dtypes)
id                     int64
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object

We have a total of $11$ features, excluding the id, which is a unique identifier for patients in this dataset. Among the relevant features are:

  • stroke: Indicates whether the patient had a stroke (1) or not (0).
  • age: Indicates the age of the patient.
  • hypertension: Indicates whether the patient has hypertension(1) or not (0).
  • heart_disease: Indicates whether the patient has a heart condition(1) or not (0).
  • bmi: Indicates the Body Mass Index of the patient.
  • smoking_status: Indicates the smoking status of the patient.
  • avg_glucose_level: Indicates the level of glucose in the patient's bloodstream.

We can also look at the data summary:

In [4]:
stroke_df.describe()
Out[4]:
id age hypertension heart_disease avg_glucose_level bmi stroke
count 5110.000000 5110.000000 5110.000000 5110.000000 5110.000000 4909.000000 5110.000000
mean 36517.829354 43.226614 0.097456 0.054012 106.147677 28.893237 0.048728
std 21161.721625 22.612647 0.296607 0.226063 45.283560 7.854067 0.215320
min 67.000000 0.080000 0.000000 0.000000 55.120000 10.300000 0.000000
25% 17741.250000 25.000000 0.000000 0.000000 77.245000 23.500000 0.000000
50% 36932.000000 45.000000 0.000000 0.000000 91.885000 28.100000 0.000000
75% 54682.000000 61.000000 0.000000 0.000000 114.090000 33.100000 0.000000
max 72940.000000 82.000000 1.000000 1.000000 271.740000 97.600000 1.000000

Cleaning the Dataset¶

Before we trying to "clean" our dataset, we must count the number of missing values in the entire dataset.

In [5]:
stroke_df.isnull().sum()
Out[5]:
id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

It appears that only the bmi column is missing values. We decided to print the ids of the rows containing missing data to identify them:

In [6]:
row_missing = [row for row in stroke_df.index if stroke_df.iloc[row].isnull().any()]
print(row_missing)
[1, 8, 13, 19, 27, 29, 43, 46, 50, 51, 54, 57, 64, 70, 78, 81, 84, 105, 112, 124, 126, 129, 133, 146, 150, 160, 161, 162, 167, 170, 171, 174, 178, 183, 189, 198, 200, 218, 227, 247, 342, 360, 432, 477, 479, 522, 668, 671, 680, 729, 742, 865, 867, 872, 879, 903, 936, 965, 1102, 1106, 1115, 1183, 1194, 1214, 1235, 1241, 1277, 1293, 1300, 1306, 1324, 1342, 1352, 1427, 1457, 1466, 1468, 1471, 1503, 1528, 1539, 1546, 1596, 1640, 1644, 1646, 1650, 1669, 1670, 1681, 1718, 1719, 1730, 1753, 1756, 1779, 1816, 1836, 1837, 1866, 1894, 1906, 1912, 1981, 1993, 2030, 2103, 2105, 2109, 2192, 2215, 2263, 2285, 2321, 2322, 2339, 2343, 2477, 2494, 2502, 2515, 2529, 2532, 2541, 2582, 2697, 2739, 2752, 2768, 2788, 2816, 2828, 2855, 2867, 2879, 2897, 2914, 2960, 2997, 3007, 3028, 3048, 3059, 3074, 3104, 3111, 3135, 3161, 3162, 3164, 3176, 3197, 3214, 3215, 3216, 3375, 3382, 3425, 3431, 3503, 3562, 3605, 3629, 3681, 3699, 3705, 3726, 3734, 3802, 3808, 3872, 3913, 3940, 3945, 3951, 4046, 4069, 4164, 4202, 4230, 4255, 4283, 4286, 4422, 4451, 4522, 4561, 4616, 4684, 4713, 4750, 4790, 4921, 4934, 4949, 4984, 5039, 5048, 5093, 5099, 5105]

To give us a visuals of the missing data, we used the missingno library. We helped ourselves with Dr. Larson's following code snippet, but adapted for our dataset:

In [7]:
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline 

import missingno

missingno.matrix(stroke_df)
plt.title("Not sorted", fontsize=22)

plt.figure()
missingno.matrix(stroke_df.sort_values(by=["bmi"]))
plt.title("Sorted",fontsize=22)
plt.show()
<Figure size 640x480 with 0 Axes>

As we can see, only a few values from the bmi column are missing. Let's find out the percentage of missing values in the bmi column.

In [8]:
bmi_col = stroke_df["bmi"].isnull()
percentage = (len(bmi_col[bmi_col == True]) / stroke_df.shape[0]) * 100
print(f'{percentage:.2f} %')
3.93 %

Just about 4% of the BMI values are missing.

There could be several reasons for missing BMI values in a dataset. One of the common reasons is incompleteness in the data collection process, where some individuals did not provide the required information or their data was not recorded accurately. Additionally, some individuals might have declined to provide their BMI information for privacy or personal reasons. There could also be cases where the individuals did not have their BMI measured or recorded due to medical conditions or other restrictions. Another reason could be due to data entry errors, where the values were incorrectly entered or omitted during data processing. Other reasons could include that data may be missing due to technical difficulties or software glitches during data storage and retrieval.

We checked for duplicates in the data set, we can see that there are no duplicates in the data

In [9]:
stroke_df.duplicated().sum()
Out[9]:
0


It is important to identify that the patient ID is not relevant when visualizing the data, as it contains no useful information. The patient ID is used to identify individual patients and is not used when looking at general trends within the data. Therefore, it is best to drop the patient ID when visualizing the data.

In [10]:
if 'id' in stroke_df:
    del stroke_df['id']

The presence or absence of marriage may or may not have an impact on the likelihood of a person having a stroke. However, the relationship between marriage and stroke is likely to be confounded by many other factors such as age, health behaviors, and medical history. Additionally, including the marriage column in a stroke dataset may not provide enough information to determine causality.

In [11]:
if 'ever_married' in stroke_df:
    del stroke_df['ever_married']

To avoid droping the rows with missing bmi values, as they may contain important information, the team decided to impute the bmi values using K-Nearest Neighbor Imputation (KNN). We chose to go with this imputation technique because the distribution of the imputed values closely matches the distribution of the original dataset.

Once again, we'll adapt Dr. Larson's following code snippet, to our dataset:

In [12]:
# impute based upon the K closest samples (rows)
from sklearn.impute import KNNImputer
import copy

# get object for imputation
knn_obj = KNNImputer(n_neighbors=3)

features_to_use = ['age','hypertension','heart_disease','avg_glucose_level', 'bmi','stroke']

# create a numpy matrix from pandas numeric values to impute
temp = stroke_df[features_to_use].to_numpy()

# use sklearn imputation object
knn_obj.fit(temp)
temp_imputed = knn_obj.transform(temp)
#    could have also done:
# temp_imputed = knn_obj.fit_transform(temp)

# this is VERY IMPORTANT, make a deep copy, not just a reference to the object
# otherwise both data frames will be manipulated
df_imputed = copy.deepcopy(stroke_df) # not just an alias
df_imputed[features_to_use] = temp_imputed
df_imputed.dropna(inplace=True)
df_imputed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5110 non-null   object 
 1   age                5110 non-null   float64
 2   hypertension       5110 non-null   float64
 3   heart_disease      5110 non-null   float64
 4   work_type          5110 non-null   object 
 5   Residence_type     5110 non-null   object 
 6   avg_glucose_level  5110 non-null   float64
 7   bmi                5110 non-null   float64
 8   smoking_status     5110 non-null   object 
 9   stroke             5110 non-null   float64
dtypes: float64(6), object(4)
memory usage: 399.3+ KB

Data Visualization 📊¶

In [13]:
# import libraries
import matplotlib.pyplot as plt
import warnings
import seaborn as sns

warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline 

Since we hope to perfect our algorithm of predicting if a patient is at risk of suffering from a stroke, we should first discover how many people have reported having a stroke, and analyze the features that this dataset covers.

In [14]:
# barplot for number of patients that report or did not suffer a stroke
plt.style.use('ggplot')
fig = plt.figure(figsize=(13,5))

sns.barplot(df_imputed.stroke.value_counts().index, df_imputed.stroke.value_counts().values)
plt.title('Number of Patients that Suffered a Stroke')
plt.ylabel('Number of Patients')
plt.xlabel('Stroke')

plt.show()
C:\Users\fahee\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
In [15]:
# print numbers and percentages of patients that reported or didnt report having a stroke
print(f'Number of patients who did not suffer a stroke: {df_imputed.stroke.value_counts()[0]:.0f}')
print(f'Number of patients who suffered a stroke: {df_imputed.stroke.value_counts()[1]:.0f}\n')

print(f'Percentage of patients who did not suffer a stroke: {df_imputed.stroke.value_counts()[0] / df_imputed.shape[0] * 100:.2f}%')
print(f'Percentage of patients who suffered a stroke: {df_imputed.stroke.value_counts()[1] / df_imputed.shape[0] * 100:.2f}%')
Number of patients who did not suffer a stroke: 4861
Number of patients who suffered a stroke: 249

Percentage of patients who did not suffer a stroke: 95.13%
Percentage of patients who suffered a stroke: 4.87%

The number of people who suffered a stroke is a very small subset of the overall dataset. However, it is still worth researching all the features within the dataset and how they influence the probability of a patient suffering a stroke.

Age¶

Age is a critical factor in stroke incidence and outcomes, as it provides valuable information for the understanding, prevention, and treatment of strokes.

In [16]:
# Histogram and density plot of age
plt.style.use('ggplot')

fig = plt.figure(figsize=(20,6))

plt.plot(1,3,1)
sns.histplot(df_imputed.age, kde = True)
plt.title("Distribution of Age among Patients")
plt.xlabel("Age")

plt.show()

It should be noted that the only bins that cross the 350 threshold is at age 80 and 40. The following highest bins are clustered just before age 60, the same location where the highest peak in the density plot occurs. The lowest bin is slightly above the 150 patient threshold. It seems as if the majority of the patients are older, though it is a steady escalation as we go across the x-axis from left to right, before falling after age 60 and ending with the huge number of patients at age 80.

Although the previous graph give us the distribution of individual patients, we can also categorize them based on their age.
Lets split the patients by age ranges:
0 - 3: Toddler
4 - 12: Child
13 - 19: Teen
20 - 65: Adult
65 and above: Elder

In [17]:
# Create age range variable
df_imputed['age_range'] = pd.cut(df_imputed['age'],
                                [0,4,13,20,35,65],
                                labels=['toddler','child','teen','adult','elder'])

ax = sns.displot(df_imputed.age_range)
ax.set(title="Number of Patients in each age range", xlabel="Age Range", ylabel="Number of Patients")

plt.show()
In [18]:
# print number of patients in each age range
print(f'Number of patients in the toddler age range: {df_imputed.age_range.value_counts()[4]:.0f}')
print(f'Number of patients in the child age range: {df_imputed.age_range.value_counts()[2]:.0f}')
print(f'Number of patients in the teen age range: {df_imputed.age_range.value_counts()[3]:.0f}')
print(f'Number of patients in the adult age range: {df_imputed.age_range.value_counts()[1]:.0f}')
print(f'Number of patients in the elder age range: {df_imputed.age_range.value_counts()[0]:.0f}')
Number of patients in the toddler age range: 255
Number of patients in the child age range: 390
Number of patients in the teen age range: 380
Number of patients in the adult age range: 874
Number of patients in the elder age range: 2246

The graph and number calculation further shows how the majority of patients in this dataset are in the elder range, followed by adults, children, teens, and toddlers in that order, respectively.

In [19]:
# Group bar chart of age range and stroke
plt.style.use("ggplot")
plt.figure(figsize=(20,6))

plt.title('Relationship between Age Range and Stroke')
plt.xlabel("Number of people who had a stroke")

df_i = df_imputed.groupby(by=['age_range','stroke'])

st_count = df_i.stroke.count()

ax = st_count.plot(kind='barh')
plt.show()

display(st_count)
age_range  stroke
toddler    0.0        254
           1.0          1
child      0.0        390
           1.0          0
teen       0.0        379
           1.0          1
adult      0.0        873
           1.0          1
elder      0.0       2156
           1.0         90
Name: stroke, dtype: int64

We can observe from the grouped bar graph that the majority of patients who had a stroke are from the elder age range. Unfortunately, there is little to no data on patients who have suffered a stroke from other age ranges. Although it is true that the elder population is highly suspectable to strokes, more data must be pursued in order to investigate the unusual circumstances of younger patients suffering strokes, if any.

BMI¶

BMI (body mass index) is very be important in determining whether a patient will suffer a stroke due to affecting other various health conditions such as hypertension, which is known to increase the risk of stroke. By including BMI in our research, we can better understand the relationship between BMI and stroke and use this information to identify and target populations at high risk for stroke.

In [20]:
# Histogram and density plot of BMI
plt.style.use('ggplot')
fig = plt.figure(figsize=(20,6))

plt.plot(1,3,1)
sns.histplot(df_imputed.bmi, kde = True)
plt.title("BMI Distribution of Age among Patients")
plt.xlabel("BMI")

plt.show()

The histogram and density plot align with each other almost perfectly, with the peak concentrating around 30. From this unimodal graph, it can be stated that a majority of the patients are considered overweight, a factor that can affect whether someone suffers a stroke. In the histogram, the highest bin cross the 350 patient threshold, while its minimum is located around 60 in the x-axis.

According to the National Health Service website, there are four categories for BMI values:
0 - 18: Underweight
18 - 24.9: Healthy
25 - 29.9: Overweight
30 and above: Obese

Let's try splitting the patients into categories based on their BMI value.

In [21]:
# create bmi index variable
df_imputed['bmi_range'] = pd.cut(df_imputed['bmi'], [0,18.5,24.9,29.9,30], labels=['underweight','healthy','overweight','obese'])

# Bar chart of bmi index
ax = sns.displot(df_imputed.bmi_range)
ax.set(title="Number of Patients in each BMI Range", xlabel="BMI Index", ylabel="Number of Patients")

plt.show()
In [22]:
# print number of patients in each bmi range as table
print(f'Number of patients in the underweight bmi range: {df_imputed.bmi_range.value_counts()[2]:.0f}')
print(f'Number of patients in the healthy bmi range: {df_imputed.bmi_range.value_counts()[1]:.0f}')
print(f'Number of patients in the overweight bmi range: {df_imputed.bmi_range.value_counts()[0]:.0f}')
print(f'Number of patients in the obese bmi range: {df_imputed.bmi_range.value_counts()[3]:.0f}')
Number of patients in the underweight bmi range: 353
Number of patients in the healthy bmi range: 1251
Number of patients in the overweight bmi range: 1482
Number of patients in the obese bmi range: 30

From the bar graph, we can observe that the majority of patients are considered overweight, followed by the healthy, underweight, and obese category in that order, respectively.

In [23]:
# Group bar chart of bmi index and stroke
plt.style.use("ggplot")
plt.figure(figsize=(20,10))

plt.title("Relationship between BMI Range and Stroke")
plt.xlabel("Number of Patients")

df_im = df_imputed.groupby(by=['bmi_range','stroke'])

st_count = df_im.stroke.count()

ax = st_count.plot(kind='barh')
plt.show()

display(st_count)
bmi_range    stroke
underweight  0.0        351
             1.0          2
healthy      0.0       1214
             1.0         37
overweight   0.0       1387
             1.0         95
obese        0.0         27
             1.0          3
Name: stroke, dtype: int64

The majority of patients who have suffered a stroke are in the overweight category. Surprisingly, the second highest number of patients who've suffered strokes come from the healthy category. It can be concluded that thought BMI alone is a strong factor, the presence of a high BMI allows for other factors to affect the probability of a patient suffering from a stroke as well.

Average Glucose Level¶

High glucose levels have been observed in countless cases of patients who have suffered a stroke. By exploring this feature, we hope to find the strength of the relationship between the average glucose level of a patient and the probabilities of suffering a stroke when looking at a patient's medical history.

In [24]:
# plot of average glucose level and stroke
plt.style.use('ggplot')
fig = plt.figure(figsize=(20,6))

plt.plot(1,3,1)
sns.histplot(df_imputed.avg_glucose_level, kde = True)
plt.title("Distribution of Average Glucose Levels among Patients")
plt.xlabel("Average Glucose Level")

plt.show()

The histogram and bimodal density plot align with each other evenly. Both graphs are skewed towards the right, where the majority of patients below 140 mg/dl, the healthy limit. The histogram and density plot falls to around 50 patients for the rest of the graph horizontally, although there is a smaller peak at around 200 - 220 mg/dl.

Let us categorize the average glucose level values for visual simplicity.

In [25]:
# create bmi index variable
df_imputed['avg_gl_range'] = pd.cut(df_imputed['avg_glucose_level'], [0,140,199,300], labels=['normal','pre-diabetic','diabetic'])

# Bar chart of bmi index
ax = sns.displot(df_imputed.avg_gl_range)
ax.set(title="Number of Patients in each Glucose Level Range", xlabel="Average Glucose Level Range", ylabel="Number of Patients")

plt.show()
In [26]:
# print number of patients in each glucose level range as table
print(f'Number of patients in the normal glucose level range: {df_imputed.avg_gl_range.value_counts()[0]:.0f}')
print(f'Number of patients in the pre-diabetic glucose level range: {df_imputed.avg_gl_range.value_counts()[2]:.0f}')
print(f'Number of patients in the diabetic glucose level range: {df_imputed.avg_gl_range.value_counts()[1]:.0f}')
Number of patients in the normal glucose level range: 4289
Number of patients in the pre-diabetic glucose level range: 376
Number of patients in the diabetic glucose level range: 445
In [27]:
# Group bar chart of bmi index and stroke
plt.style.use("ggplot")
plt.figure(figsize=(20,10))

plt.title("Relationship between Average Glucose Level Range and Stroke")
plt.xlabel("Number of Patients")

df_im = df_imputed.groupby(by=['avg_gl_range','stroke'])

st_count = df_im.stroke.count()

ax = st_count.plot(kind='barh')
plt.show()

It is interesting to observe how the largest number of stroke patients have healthy average glucose levels, though it is possible to explain it through the large number of patients who have healthy levels of average glucose in their bloodstreeam. The second highest number of stroke patients are diabetic, followed by pre-diabetic.

Though the average glucose level of a person can affect whether or not they suffer from a stroke, due to the imbalance of healthy individuals and unhealthy individuals in the aspect of their average glucose levels alone, we cannot concretely say that the average glucose level is a strong indicator by itself.

Gender¶

Gender can play as a factor in whether a patient is at risk of suffering a stroke. Whether it is a strong enough factor on its own or must be in conjuction with other circumstances will be determined through analysis.

In [28]:
# bar graph of gender and stroke
plt.style.use('ggplot')
fig = plt.figure(figsize=(15,5))

pd.crosstab([df_imputed['gender']],df_imputed.stroke.astype(int)).plot(kind='bar', stacked=False) 
plt.title("Gender of Patients with and without Stroke")
plt.ylabel("Number of Patients")

plt.show()
<Figure size 1500x500 with 0 Axes>
In [29]:
# table for gender and stroke
freq_table = pd.DataFrame(pd.crosstab(df_imputed['gender'], df_imputed['stroke']))
display(freq_table)
stroke 0.0 1.0
gender
Female 2853 141
Male 2007 108
Other 1 0

The difference in female to male patients who did not suffer a stroke is 846, while the difference in female to male patients who suffered a stroke is 33.

In [30]:
sns.boxplot(x="gender", y="age", hue="stroke", data=df_imputed)
plt.title('Relationship between Gender and Stroke')

plt.show()

From the boxplot, we can observe in the red box plots how the 1st quartile of the male boxplot is lower than the first quartile of the female boxplot. The previous observation is reversed when we observe the blue boxplots. This could indicate how females are more likely to suffer a stroke at a younger age, especially as we notice how the minimum is smaller for the female boxplot, and we have two outliers in the female boxplot.

The blue box plots allows us to reaffirm that patients who suffered a stroke tend to be older.

Though gender can affect the probability of suffering a stroke, alone, it is not a strong indicator.

Hypertension and Heart Disease¶

It is well known that certain medical conditions can increase the likelihood of a stroke. This dataset focuses on two conditions: Hypertension and Heart Disease.

In [31]:
plt.style.use('ggplot')
fig = plt.figure(figsize=(20,8))

# violin plot for hypertension
plt.subplot(1,2,1)
ax1= sns.violinplot(x="hypertension", y="age", hue="stroke", data=df_imputed, 
            split=True, # split across violins
            inner="quart", # show innner stats like mena, IQR, 
            scale="count") # scale the size of the plot by the count within each group

ax1.set(title="Age Distribution of Patients with and without Hypertension", xlabel="Hypertension", ylabel="Age")

# violin plot for heart disease
plt.subplot(1,2,2)
ax2 = sns.violinplot(x="heart_disease", y="age", hue="stroke", data=df_imputed, 
            split=True, # split across violins
            inner="quart", # show innner stats like mena, IQR, 
            scale="count") # scale the size of the plot by the count within each group

ax2.set(title="Age Distribution of Patients with and without Heart Disease", xlabel="Heart Disease", ylabel="Age")

plt.show()

Across all violin plots, we observe that the majority of patients did not suffer a stroke as we declared in the beginning of the data visualization section. An interesting observation is that for both violin plots where patients have the medical condition, the blue side of the plot is located on the higher end of the age axis. Another observation is that when we compare the blue sides of the violin plots where patients have the medical condition, the heart disease violin plot has a higher peak than the hypertension violin plot. Let us look at the numbers closely again.

In [32]:
# table for hypertension and stroke
freq_table1 = pd.DataFrame(pd.crosstab(df_imputed['hypertension'], df_imputed['stroke']))
display(freq_table1)

# table for heart disease and stroke
freq_table2 = pd.DataFrame(pd.crosstab(df_imputed['heart_disease'], df_imputed['stroke']))
display(freq_table2)
stroke 0.0 1.0
hypertension
0.0 4429 183
1.0 432 66
stroke 0.0 1.0
heart_disease
0.0 4632 202
1.0 229 47

It is interesting to note how the number of patients who had both the medical condition and stroke is greater than the number of patients who had the medical condition but did not suffer a stroke. Also, the number of patients who had heart disease and a stroke is greater than the number of patients who had hypertension and a stroke, although the different is not significant. It raises the question of whether heart disease is more influencial than hypertension when determining whether a patient will suffer a stroke, but it is not possible to find the answer this the size of this dataset, as more data is needed.

Residence and Work type¶

Not only must we consider medical factor, but also environmental factors as well, since elements surrounding their everyday life can influence whether a patient will suffer from a stroke. This dataset considers a patient's residence type and work type.

In [33]:
# group bargraph for residence type and stroke
plt.style.use('ggplot')
plt.figure(figsize=(20,10))

plt.title("Relationship between residence type and stroke")
plt.xlabel("Number of Patients")

df_im = df_imputed.groupby(by=['Residence_type','stroke'])

st_count = df_im.stroke.count()
ax = st_count.plot(kind='barh')

plt.show()
In [34]:
sns.boxplot(x="Residence_type", y="age", hue="stroke", data=df_imputed)
plt.title('Relationship between the Residence Type and Stroke')

plt.show()

If we look at the red box plots first, they appear exactly the same. Looking at the blue box plots however, there are some differences. The age range for both box plots are the same, but the minimum is higher for patients living in rural residences. However, though both have a single outlier, the urban outlier is far lower than the rural outlier. We can make a claim that people living in rural residences are more likely to suffer a stroke, but the relationship between residence type and stroke is minimal.

Let's look at the numbers closely.

In [35]:
freq_table1 = pd.DataFrame(pd.crosstab(df_imputed['Residence_type'], df_imputed['stroke']))
display(freq_table1)
stroke 0.0 1.0
Residence_type
Rural 2400 114
Urban 2461 135

There are more stroke patients from urban residences than rural, but not by significant margins. Let's focus on work type now.

In [36]:
plt.style.use('ggplot')
plt.figure(figsize=(20,10))

# group bargraph for work type and stroke
plt.title("Relationship between work type and stroke")
plt.xlabel("Number of Patients")

df_im = df_imputed.groupby(by=['work_type','stroke'])

st_count = df_im.stroke.count()
ax = st_count.plot(kind='barh')

plt.show()
In [37]:
sns.boxplot(x="work_type", y="age", hue="stroke", data=df_imputed)
plt.title('Stroke')

plt.show()

We can observe that the private work type has the highest number of patients for those who have not suffered a stroke and those who have suffered a stroke. Focusing in patients who have suffered a stroke, after the private work type are those who are self employed, work government jobs, and children. With patients who have not suffered a stroke, the majority of patients reported having a private work type job, followed by children, self employed, government jobs, then those who have never worked, in that order respecitvely.

As expected, the boxplots for children is in the lower age range, though we can observe that the maximum is in the boxplot for children who did not suffer a stroke. In the boxplot depicting patients who have never worked, their age range is also quite low, and notice how they are no reports of having a stroke.

Looking at the rest of the other boxplots, the private work type of those who suffered a stroke has the lowest miminum by a large margin. Additionally, there quite a lot of outliers in the blue boxplot of patients who are self-employed.

Perhaps the work type can affect the probability of someone suffering a stroke, but then the categorical labels do not provide enough information for us to determine that exactly from that work type increases the probability of someone suffering a stroke.

Smoking Status¶

A patients smoking status can increase the likelihood of other medical conditions, but we will analyze how strong the correlation is between a patient's smoking status and the probability of suffering a stroke.

In [38]:
# plot of Age and Smoking status
plt.style.use("ggplot")
plt.subplots(figsize=(15, 10))

#int vs categ
sns.swarmplot(x="smoking_status", y="age", hue="stroke", data=df_imputed)

plt.show()
C:\Users\fahee\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 12.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

It is interesting to note how the majority of patients who have suffered a stroke are above age 40. We cannot determine which smoking status category has the greatest number of stroke patients, though we can observe how a majority of patients have never smoked due to the width of the particular swarmplot.

We used age along with the smoking status of a patient to try to determine which smoking status correlates the strongest with suffering a stroke.
Let us now use BMI.

In [39]:
# Swarm plot of BMI and smoking status
plt.style.use("ggplot")
plt.subplots(figsize=(15, 10))
plt.title("Swarm Plot of Smoking Status and Age with Stroke as hue")

sns.swarmplot(x="smoking_status", y="bmi", hue="stroke", data=df_imputed)

plt.show()
C:\Users\fahee\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 17.9% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
C:\Users\fahee\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 46.5% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
C:\Users\fahee\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 12.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
C:\Users\fahee\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 37.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

It seems that there is little if any correlation between the BMI and smoking status of a patient, as the number of stroke patients is spread out randonly throughout the BMI values. It shoud be noted that the never smoked, smokes, and unknown category have some outliers with significant distance from the main body of the swarm plot, though those patients did not suffer a stroke.

Though smoking status could act as a factor, we did not find any strong correlation to the probabilities of someone suffering a stroke.

3 Interesting Questions 🧐🙋¶

In this section, the team asks three interesting questions and tries to answer them using appropriate visualization methods:

  1. Are patients who smoke 🚬 more at risk of having a stroke?

Our goal is to determine the number of patients who had a stroke according to their smoking status.

In [40]:
patient_with_stroke = len(df_imputed[ df_imputed ['stroke'] == 1.0 ])
print(f"The number of patient who suffered a stroke is {patient_with_stroke}")
The number of patient who suffered a stroke is 249
In [41]:
temp_df = df_imputed[ df_imputed ['stroke'] == 1.0 ]
temp_df.groupby(by=['smoking_status'])['stroke'].count()
Out[41]:
smoking_status
Unknown            47
formerly smoked    70
never smoked       90
smokes             42
Name: stroke, dtype: int64

We can have a better visual experience by plotting the above results using a bar graph like so:

In [42]:
plt.style.use('ggplot')
temp_df = df_imputed[ df_imputed ['stroke'] == 1.0 ]

plt.hist(temp_df['smoking_status'])
plt.ylabel('Frequency count')
plt.xlabel('Smoking Status');
plt.title('Smoking status of the individuals with a stroke')
plt.show()

Interesting!

As we can see, the majority of patients who suffered a stroke had never smoked before, or had completely stopped smoking. Consequently, the team concluded that smoking status alone is not a strong indicator.

  1. Are patients with high levels of glucose more at risk of having a stroke?

  2. How corralated are the features of our dataset?

To answer both questions, the team decided to use a correlation matrix.

In [43]:
# correlation plot
cmap = sns.set(style="darkgrid") # one of the many styles to plot using

f, ax = plt.subplots(figsize=(8, 8))
sns.heatmap(df_imputed.corr(), cmap=cmap, annot=True)
 
f.tight_layout()

Despite not observing very strong correlations between features, it appears that the BMI and the glucose level show some level of relationship. The same can be said for the heart disease and the glucose level as well.

Observing a relationship between the glucose level, BMI, and heart disease is telling because patients with high BMIs (body weight) are more likely to develop diabetes, which is the result of an excessive amount of glucose in the bloodstream 🩸

In [44]:
new_vars_to_use = ['age','hypertension','heart_disease','avg_glucose_level','stroke', 'smoking_status']
df_imputed.smoking_status.value_counts()

sns.pairplot(df_imputed[new_vars_to_use], hue='smoking_status')
Out[44]:
<seaborn.axisgrid.PairGrid at 0x1ee0f7bf790>

UMAP¶

We refered to the website: https://umap-learn.readthedocs.io/en/latest/basic_usage.html

In [45]:
import umap
In [46]:
reducer = umap.UMAP()

Prior to dimensionality reduction, standardization is necessary to lessen the influence of large-valued variables, which otherwise would dominate the computation of distance. Standardization is stronger against outliner.

Standarization= (x-x(mean))/standard deviation

In [47]:
df_imputed.describe()
Out[47]:
age hypertension heart_disease avg_glucose_level bmi stroke
count 5110.000000 5110.000000 5110.000000 5110.000000 5110.000000 5110.000000
mean 43.226614 0.097456 0.054012 106.147677 28.949380 0.048728
std 22.612647 0.296607 0.226063 45.283560 7.783078 0.215320
min 0.080000 0.000000 0.000000 55.120000 10.300000 0.000000
25% 25.000000 0.000000 0.000000 77.245000 23.700000 0.000000
50% 45.000000 0.000000 0.000000 91.885000 28.200000 0.000000
75% 61.000000 0.000000 0.000000 114.090000 33.100000 0.000000
max 82.000000 1.000000 1.000000 271.740000 97.600000 1.000000

We can see that the stroke is not the int, so we convert it into ints to futher analyze the data.

In [48]:
df_imputed['stroke'] = df_imputed['stroke'].astype(int)
In [49]:
df_imputed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   gender             5110 non-null   object  
 1   age                5110 non-null   float64 
 2   hypertension       5110 non-null   float64 
 3   heart_disease      5110 non-null   float64 
 4   work_type          5110 non-null   object  
 5   Residence_type     5110 non-null   object  
 6   avg_glucose_level  5110 non-null   float64 
 7   bmi                5110 non-null   float64 
 8   smoking_status     5110 non-null   object  
 9   stroke             5110 non-null   int32   
 10  age_range          4145 non-null   category
 11  bmi_range          3116 non-null   category
 12  avg_gl_range       5110 non-null   category
dtypes: category(3), float64(5), int32(1), object(4)
memory usage: 394.9+ KB
In [50]:
df_imputed['stroke'] = df_imputed['stroke'].astype("category")
In [51]:
df_imputed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   gender             5110 non-null   object  
 1   age                5110 non-null   float64 
 2   hypertension       5110 non-null   float64 
 3   heart_disease      5110 non-null   float64 
 4   work_type          5110 non-null   object  
 5   Residence_type     5110 non-null   object  
 6   avg_glucose_level  5110 non-null   float64 
 7   bmi                5110 non-null   float64 
 8   smoking_status     5110 non-null   object  
 9   stroke             5110 non-null   category
 10  age_range          4145 non-null   category
 11  bmi_range          3116 non-null   category
 12  avg_gl_range       5110 non-null   category
dtypes: category(4), float64(5), object(4)
memory usage: 380.0+ KB
In [52]:
df_imputed['gender'] = df_imputed['gender'].astype("category")
In [53]:
df_imputed['hypertension'] = df_imputed['hypertension'].astype("category")
In [54]:
df_imputed['heart_disease'] = df_imputed['heart_disease'].astype("category")
In [55]:
df_imputed['work_type'] = df_imputed['work_type'].astype("category")
In [56]:
df_imputed['Residence_type'] = df_imputed['Residence_type'].astype("category")
In [57]:
df_imputed['smoking_status'] = df_imputed['smoking_status'].astype("category")

We categorized each variable, Converted the stroke to int, so it would be helpful to plot on a umap.

In [58]:
from sklearn.preprocessing import StandardScaler
df_select_feature = df_imputed.select_dtypes(include='number')
In [59]:
type(df_select_feature)
Out[59]:
pandas.core.frame.DataFrame
In [60]:
df_select_feature=df_select_feature.to_numpy()
In [61]:
df_select_feature
Out[61]:
array([[ 67.        , 228.69      ,  36.6       ],
       [ 61.        , 202.21      ,  30.86666667],
       [ 80.        , 105.92      ,  32.5       ],
       ...,
       [ 35.        ,  82.99      ,  30.6       ],
       [ 51.        , 166.29      ,  25.6       ],
       [ 44.        ,  85.28      ,  26.2       ]])
In [62]:
%matplotlib inline
In [63]:
sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)})
In [64]:
scaled_data = StandardScaler().fit_transform(df_select_feature)
In [65]:
embedding = reducer.fit_transform(scaled_data)
embedding.shape
Out[65]:
(5110, 2)
In [66]:
# we plotted the embedded data ugainst the smoking status to predict data
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in df_imputed.smoking_status.map({"formerly smoked":0, "never smoked":1, "smokes":2, 
                                                            "Unknown":3})])
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Stoke dataset', fontsize=24);
In [67]:
df_imputed.gender.unique()
Out[67]:
['Male', 'Female', 'Other']
Categories (3, object): ['Female', 'Male', 'Other']
In [68]:
#we then tried to plot the Gender with the embedded values
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in df_imputed.gender.map({"Male":0, "Female":1, "Other":2})])
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Stoke dataset', fontsize=24);

UMAP 1

In [69]:
import numpy as np
In [70]:
plt.scatter(embedding[:, 0], embedding[:, 1], c=df_imputed.stroke, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('UMAP projection of the heart stroke dataset', fontsize=24);

UMAP 2

In [71]:
df_imputed.smoking_status.cat.codes
Out[71]:
0       1
1       2
2       2
3       3
4       2
       ..
5105    2
5106    2
5107    2
5108    1
5109    0
Length: 5110, dtype: int8
In [72]:
plt.scatter(embedding[:, 0], embedding[:, 1], c=df_imputed.smoking_status.cat.codes, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('UMAP projection of the heart stroke dataset', fontsize=24);

The non-linear dimensionality reduction technique UMAP (Uniform Manifold Approximation and Projection) has been found to be particularly effective for medical datasets, notably those for stroke prediction.

UMAP can better capture the complicated non-linear correlations in medical datasets since it is more adaptable than conventional linear dimensionality reduction techniques (like PCA or t-SNE). This is significant because numerous medical datasets, like those for the prediction of strokes, contain a substantial number of variables that may be highly interdependent.

UMAP reduces the dimensionality of the data, which makes it simpler to visualize the connections between variables and spot trends or correlations that might be crucial for forecasting the likelihood of having a stroke. As it reduces the number of variables and lessens data noise, UMAP also aids in preventing overfitting and enhancing the interpretability of models.

UMAP is a strategy that shows promise since it can maintain non-linear correlations in complicated datasets.

Overall, UMAP is a promising approach for stroke prediction since it has the potential to increase the precision and interpretability of predictive models due to its capacity to preserve non-linear correlations in complicated datasets.

Referenced https://www.ijrte.org/wp-content/uploads/papers/v9i1/A2961059120.pdf for why Umap would be a promising model for medical data sets.